Goto

Collaborating Authors

 gen intel xeon scalable processor


Inference Acceleration for Large Language Models on CPUs

PS, Ditto, VG, Jithin, MS, Adarsh

arXiv.org Artificial Intelligence

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.


The Future of AI: Everywhere, on the Edge, Transforming Our World

#artificialintelligence

The rapid advances in artificial intelligence (AI) as demonstrated by the recent launch of GPT-4 and previously by ChatGPT are generating a great deal of excitement. Artificial intelligence continues to evolve by offering new possibilities in various industries and aspects of human existence, creating numerous debates about its potential impact on our everyday lives and across the global economy. The C-Suite of large organisations in different sectors are actively discussing whether and how such models may be deployed within their organisations, whilst at the same time there has been a rapid adoption of the models by end users. However, Large Language Models (LLMs) using Transformers with the self-attention mechanism is not the only area of AI that is advancing rapidly. Alongside the vast potential of LLMs and the Transformer based approach that underlies it, is also the rise of the AI on the Edge (of the network), across the devices that we interact with in our daily lives.


AI analytics & Edge compute just accelerated, now what will innovators do with it?

#artificialintelligence

Do not take the Intel portfolio for granted. Sure, Intel products are present everywhere in our digitalised world. But this company is way more than silicon, hardware, and software. Not long ago, Intel introduced customisable silicon (such a win for their customers) and rapid-deployment options like Intel Select Solutions pre-verified configurations of hardware and software. Now, the conversation has turned to the built-in AI acceleration on the newest 3rd Gen Intel Xeon Scalable processors; quite the incredible AI-infused, data-intensive digital solution.


Artificial Intelligence Gets A Boost With The Latest Generation Intel XEON Scalable Processors That Drives Inference At Scale

#artificialintelligence

They also demand increased flexibility with hardware that allows them to program with mainstream languages at a higher abstraction level along with libraries. The data science community is looking for a complete solution stack that abstracts away the hardware specifics, allowing them the ease to crunch parallel workloads more efficiently.